Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Checking for missing files in parallel #224

Merged
merged 2 commits into from
Sep 7, 2022
Merged

Checking for missing files in parallel #224

merged 2 commits into from
Sep 7, 2022

Conversation

AllenDowney
Copy link
Contributor

Closes #216

Checking for missing files is slow with goofys. It seems to make lots of small queries to the file system. Running them in parallel with pqdm is much faster. The speed depends on the state of the file system cache, but we can check 246,000 files in 5-8 minutes, compared to about two hours the slow way.

Using pqdm with threads is faster than with processes. Using 16 threads seems to be fast and robust. With more threads, things go faster, but you start to see unpredictable I/O errors.

If an error occurs, it falls back to the slow way.

This fix has only been tested with video files that are mounted from S3 using goofys. It might be good to test with videos stored in a local file system, too.

@netlify
Copy link

netlify bot commented Sep 7, 2022

Deploy Preview for silly-keller-664934 ready!

Name Link
🔨 Latest commit b5fd541
🔍 Latest deploy log https://app.netlify.com/sites/silly-keller-664934/deploys/6318b9d4cd2d0a0008a9ee62
😎 Deploy Preview https://deploy-preview-224--silly-keller-664934.netlify.app
📱 Preview on mobile
Toggle QR Code...

QR Code

Use your smartphone camera to open QR code link.

To edit notification comments on pull requests, go to your Netlify site settings.

@github-actions
Copy link
Contributor

github-actions bot commented Sep 7, 2022

@codecov-commenter
Copy link

codecov-commenter commented Sep 7, 2022

Codecov Report

Merging #224 (b5fd541) into master (17d291b) will decrease coverage by 0.0%.
The diff coverage is 77.7%.

@@           Coverage Diff            @@
##           master    #224     +/-   ##
========================================
- Coverage    87.0%   87.0%   -0.1%     
========================================
  Files          29      29             
  Lines        1930    1937      +7     
========================================
+ Hits         1681    1686      +5     
- Misses        249     251      +2     
Impacted Files Coverage Δ
zamba/models/config.py 96.7% <77.7%> (-0.6%) ⬇️

@AllenDowney AllenDowney requested a review from ejm714 September 7, 2022 15:50
Copy link
Collaborator

@ejm714 ejm714 left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Code looks great and 5-6 minutes instead of 2 hours is a fantastic improvement!

I agree it's worth testing with local files as well. Can you do that? You can just mock it by downloading one sample videos, making 100 copies or so (just a little bash script would do), and then creating a simple labels file, e.g.

filepath,label
vid1.mp4,gorilla
vid2.mp4,gorilla
vid3.mp4,gorilla
vid4.mp4,gorilla

@AllenDowney AllenDowney changed the title Checking for missing miles in parallel Checking for missing files in parallel Sep 7, 2022
@AllenDowney
Copy link
Contributor Author

Confirmed that it works with 22 local files (a convenience sample of videos)

@AllenDowney AllenDowney requested a review from ejm714 September 7, 2022 18:30
Copy link
Collaborator

@ejm714 ejm714 left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Thanks @AllenDowney!

@ejm714 ejm714 merged commit 570f9cc into master Sep 7, 2022
@ejm714 ejm714 deleted the fast-file-check branch September 7, 2022 19:47
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

Successfully merging this pull request may close these issues.

Speed up or eliminate checking filepaths
3 participants